| variable | description |
|---|---|
| track_id | Song unique ID |
| track_name | Song Name |
| track_artist | Song Artist |
| track_popularity | Song Popularity (0-100) where higher is better |
| track_album_id | Album unique ID |
| track_album_name | Song album name |
| track_album_release_date | Date when album released |
| playlist_name | Name of playlist |
| playlist_id | Playlist ID |
| playlist_genre | Playlist genre |
| playlist_subgenre | Playlist subgenre |
| danceability | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | Duration of song in milliseconds |
first initiate and download packages
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(here)
## here() starts at /home/rstudio/Documents/GroupAssigments#1
library(corrplot)
## corrplot 0.92 loaded
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(scales)
library(ggrepel)
library(reactable)
here::i_am("Spotify_Dataset_EDA.Rproj")
## here() starts at /home/rstudio/Documents/GroupAssigments#1
second we load the dataset
SpotifyData <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Interactive tables
using the glimpse() : this function allows the features to be displayed vertically while the data can be viewed horizontally
dplyr::glimpse(SpotifyData)
## Rows: 32,833
## Columns: 23
## $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
## $ track_name <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
## $ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
## $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
## $ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
## $ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
## $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
## $ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "dance…
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
## $ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
## $ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
## $ duration_ms <dbl> 194754, 162600, 176616, 169093, 189052, 16304…
a summary of the dataset
summary(SpotifyData)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
to get a on overview of the dataset we used the corrgram() function to display the correlation of all the numeric feature
featuresCorr <- dplyr::select(SpotifyData , track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo ,key)
corrgram::corrgram(cor(featuresCorr),order = TRUE, upper.panel = NULL)
TopMonth=SpotifyData %>%
select(-c(("track_id"),("track_album_id"),("playlist_name"),("playlist_id"),("duration_ms")))
we wanted to see if release month is effected by genre.
TopMonth=SpotifyData %>%
select(-c(("track_id"),("track_album_id"),("playlist_name"),("playlist_id"),("duration_ms")))
TopMonth$track_album_release_date=as.Date(TopMonth$track_album_release_date)
TopMonth=TopMonth %>%
mutate(month=month(TopMonth$track_album_release_date),year=year(TopMonth$track_album_release_date)) %>%
select(-c("track_album_release_date"))
TopMonth =TopMonth%>%
mutate(period = case_when(
month == 1 ~ 'january'
, month == 2 ~ 'february'
, month == 3 ~ 'march'
, month == 4 ~ 'april'
, month == 5 ~ 'may'
, month == 6 ~ 'june'
, month == 7 ~ 'july'
, month == 8 ~ 'august'
, month == 9 ~ 'september'
, month == 10 ~ 'october'
, month == 11 ~ 'november'
, month == 12 ~ 'december'))
TopMonth= TopMonth %>%
group_by(period) %>%
count(playlist_genre)
TopMonth = na.omit(TopMonth)
TopMonth$period = factor(TopMonth$period, levels = c('january', 'february', 'march', 'april', 'may', 'june' ,'july' ,'august', 'september', 'october', 'november' ,'december'))
library("ggplot2")
ggplot(TopMonth,aes(x=period,y=n, group = playlist_genre))+
geom_point(aes(colour=playlist_genre))+
geom_line(aes(colour=playlist_genre))+
xlab("Month")+ylab("Number of songs")+
ggtitle("Number of songs released during each month")
From the above line chart we cannot conclude there is any correctional between genre and release date.
Are tracks with speech/words more popular than instrumental? to test this we got the median popularity for with a high speechiness and songs with high instrumentalness
SpotifyData %>% filter(speechiness > .80) %>% pull(track_popularity) %>% median()
## [1] 54.5
SpotifyData %>% filter(instrumentalness > .80) %>% pull(track_popularity) %>% median()
## [1] 36
Distribution of loudness across genres
ggplot(data = SpotifyData, mapping = aes(x = playlist_genre, y=loudness, fill=playlist_genre, alpha=0.05)) +
geom_boxplot()+
theme_bw()+
theme(legend.position = "none")+
labs(x = "Playlist Genres",
y = "loudness",
title = "Distribution of loudness by using boxpplot")
which genre is the most popular?
ggplot(SpotifyData, aes(x = track_popularity, ..density.., color = playlist_genre)) +
geom_freqpoly() +
coord_cartesian(ylim = c(0, 500)) +
coord_cartesian(xlim = c(20, 120)) +
NULL
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# rock genre has very interesting popularity distribution (right-hand tail)
SpotifyData %>%
filter(playlist_genre == "rock") %>%
filter(track_popularity > 5) %>%
ggplot(aes(track_popularity)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# compared to pop genre
SpotifyData %>%
filter(playlist_genre == "pop") %>%
filter(track_popularity > 5) %>%
ggplot(aes(track_popularity)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
what key is played the most
ggplot(SpotifyData) + geom_bar(ggplot2::aes( key),width = 0.3) +
labs(x = "keys")